Skip to content

QVAC-3697: Load GGUF File From Buffer #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: temp-load-from-buffer
Choose a base branch
from

Conversation

jesusmb1995
Copy link

@jesusmb1995 jesusmb1995 commented Jul 30, 2025

This pull request makes changes in Llama.cpp in order to be able to load models directly from memory. It is intended to be reviewable by commit. Individual commits contain a long text description below the header.

Tested that works properly from a bare Addon (LLM repo). See #1 (comment)

In particular, this PR exposes:

  • llama-cpp.h:llama_model_load_from_buffer(vector<uint8_t>&& data, ...) to load from a single buffer containing a .gguf file contents.
  • llama.h:llama_model_load_from_split_futures(char** paths, ...) and llama-cpp.h:llama_model_load_fulfill_split_future(char* path, ..., unique_ptr<basic_streambuf<uint8_t>>&& streambuf) which allow to asynchronously/incrementally load a model and upload its tensors to the backend storage while host memory is being released.

How to run the code?

Build and prepare model

Build (e.g. in release mode) LLama.cpp including the examples, tests and tools:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DLLAMA_BUILD_TOOLS=ON -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_EXAMPLES=ON -DGGML_VULKAN=ON && cmake --build build

Generate a sharded model and its *.tensor.txt summary file:

./build/bin/llama-gguf-split --split --split-max-size 300M models/qwen3/Qwen3-0.6B-Q8_0.gguf Qwen3-0.6B-Q8_0 &&
 mv Qwen*.* models/qwen3

Automated tests

Run automated tests for a single gguf file:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory$ --verbose

Run automated tests for sharded model:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf
ctest -R ^test-model-load-disk$ --verbose
ctest -R ^test-model-load-memory-split$ --verbose

Or simply run all tests:

cd build
export LLAMACPP_TEST_MODELFILE=../models/qwen3/Qwen3-0.6B-Q8_0.gguf
ctest

Should output:

...
30/41 Test #30: test-backend-ops ..................   Passed  104.24 sec                                                     
      Start 31: test-model-load-cancel                        
31/41 Test #31: test-model-load-cancel ............   Passed    0.34 sec                                                     
      Start 32: test-model-load-disk                          
32/41 Test #32: test-model-load-disk ..............   Passed    0.43 sec                                                     
      Start 33: test-model-load-memory                        
33/41 Test #33: test-model-load-memory ............   Passed    0.00 sec                                                     
      Start 34: test-model-load-memory-split                  
34/41 Test #34: test-model-load-memory-split ......   Passed    0.67 sec 
...
41/41 Test #41: test-eval-callback ................   Passed    0.84 sec

100% tests passed, 0 tests failed out of 41

Label Time Summary:
curl             =   0.84 sec*proc (1 test)
eval-callback    =   0.84 sec*proc (1 test)
main             = 136.15 sec*proc (35 tests)
model            =   1.79 sec*proc (5 tests)

Examples

Demo video: https://drive.google.com/file/d/1mjqecwJ1LFYUNofr4wIdPFK9IkUxbHZh/view?usp=sharing

Set up the environment:

# Do not export any variable to load from disk
# export LLAMA_EXAMPLE_MEMORY_BUFFER=1
export LLAMA_EXAMPLE_MEMORY_BUFFER_SPLIT=1

# Alternatively pass a single .gguf file and set _MEMORY_BUFFER=1
export GGUF_PATH="models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"

Run example with Qwen3:

/usr/bin/time -v ./build/bin/llama-simple -m "$GGUF_PATH"

Outputs:

...
print_backend_buffers_info: offloading 28 repeating layers to GPU
print_backend_buffers_info: offloading output layer to GPU
print_backend_buffers_info: offloaded 29/29 layers to GPU
print_backend_buffers_info:      Vulkan0 model buffer size =   199.11 MiB
print_backend_buffers_info:  Vulkan_Host model buffer size =   157.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    44.65 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    45.71 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.83 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    47.84 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    46.78 MiB
print_backend_buffers_info:      Vulkan0 model buffer size =    31.89 MiB
llama_context: constructing llama_context
llama_context: n_batch is less than GGML_KQ_MASK_PAD - increasing to 64
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 35
llama_context: n_ctx_per_seq = 35
llama_context: n_batch       = 64
llama_context: n_ubatch      = 64
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (35) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 64 (padded)
llama_kv_cache_unified: layer   0: dev = Vulkan0
llama_kv_cache_unified: layer   1: dev = Vulkan0
llama_kv_cache_unified: layer   2: dev = Vulkan0
llama_kv_cache_unified: layer   3: dev = Vulkan0
llama_kv_cache_unified: layer   4: dev = Vulkan0
llama_kv_cache_unified: layer   5: dev = Vulkan0
llama_kv_cache_unified: layer   6: dev = Vulkan0
llama_kv_cache_unified: layer   7: dev = Vulkan0
llama_kv_cache_unified: layer   8: dev = Vulkan0
llama_kv_cache_unified: layer   9: dev = Vulkan0
llama_kv_cache_unified: layer  10: dev = Vulkan0
llama_kv_cache_unified: layer  11: dev = Vulkan0
llama_kv_cache_unified: layer  12: dev = Vulkan0
llama_kv_cache_unified: layer  13: dev = Vulkan0
llama_kv_cache_unified: layer  14: dev = Vulkan0
llama_kv_cache_unified: layer  15: dev = Vulkan0
llama_kv_cache_unified: layer  16: dev = Vulkan0
llama_kv_cache_unified: layer  17: dev = Vulkan0
llama_kv_cache_unified: layer  18: dev = Vulkan0
llama_kv_cache_unified: layer  19: dev = Vulkan0
llama_kv_cache_unified: layer  20: dev = Vulkan0
llama_kv_cache_unified: layer  21: dev = Vulkan0
llama_kv_cache_unified: layer  22: dev = Vulkan0
llama_kv_cache_unified: layer  23: dev = Vulkan0
llama_kv_cache_unified: layer  24: dev = Vulkan0
llama_kv_cache_unified: layer  25: dev = Vulkan0
llama_kv_cache_unified: layer  26: dev = Vulkan0
llama_kv_cache_unified: layer  27: dev = Vulkan0
llama_kv_cache_unified:    Vulkan0 KV buffer size =     7.00 MiB
llama_kv_cache_unified: size =    7.00 MiB (    64 cells,  28 layers,  1 seqs), K (f16):    3.50 MiB, V (f16):    3.50 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 64, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =   64, n_seqs =  1, n_outputs =   64
llama_context:    Vulkan0 compute buffer size =    37.34 MiB
llama_context: Vulkan_Host compute buffer size =     0.27 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 2
Hello my name is Emily. I'm a student in the 10th grade. I'm interested in studying in the field of mathematics. I want to kn
ow how to study
main: decoded 32 tokens in 0.18 s, speed: 174.70 t/s

llama_perf_sampler_print:    sampling time =       2.62 ms /    32 runs   (    0.08 ms per token, 12195.12 tokens per second)
llama_perf_context_print:        load time =     402.14 ms
llama_perf_context_print: prompt eval time =      10.13 ms /     4 tokens (    2.53 ms per token,   394.91 tokens per second)
llama_perf_context_print:        eval time =     166.08 ms /    31 runs   (    5.36 ms per token,   186.65 tokens per second)
llama_perf_context_print:       total time =     575.19 ms /    35 tokens

	Command being timed: "./build/bin/llama-simple -m models/qwen3/Qwen3-0.6B-Q8_0-00001-of-00010.gguf"
	User time (seconds): 0.37
	System time (seconds): 0.44
	Percent of CPU this job got: 88%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.93
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 1101056
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 225849
	Voluntary context switches: 796
	Involuntary context switches: 15
	Swaps: 0
	File system inputs: 0
	File system outputs: 32
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Run example with GTE:

# GGUP_PATH points to gte-large.Q2_K-00001-of-00003.gguf, for example.
/usr/bin/time -v ./build/bin/llama-embedding --model "$GGUF_PATH" --ngl 999

Related PRs


Asana task: https://app.asana.com/1/45238840754660/project/1210873391319186/task/1210877463428607


Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)
Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.
This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.
The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).
@jesusmb1995 jesusmb1995 marked this pull request as draft July 30, 2025 18:24
@jesusmb1995
Copy link
Author

I seem to lack permissions to add reviewers. It is on draft until I test it on a bare Addon but the review of the Llama.cpp C++ code can start: @olyasir @olek-tether @gianni-cor @chetasr @yuranich @jpgaribotti

@jesusmb1995 jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 02227e3 to 0718c30 Compare July 30, 2025 20:16
@jesusmb1995
Copy link
Author

Updated tests to automatically skip based on the gguf filename (sharded or not) when running all tests at once.

@jesusmb1995 jesusmb1995 force-pushed the jmb/memory_load_pr branch 2 times, most recently from 5df4e25 to 52ed642 Compare July 30, 2025 20:49
@jesusmb1995 jesusmb1995 self-assigned this Aug 14, 2025
@jesusmb1995 jesusmb1995 marked this pull request as ready for review August 14, 2025 15:08
@jesusmb1995
Copy link
Author

jesusmb1995 commented Aug 14, 2025

Un-drafting since I was able to run JS integration test for qwen3 llm Addon without problems. The test now can use any dataloader implementation and will incrementally load the Llama.cpp model. See successful log below.

log_integration.txt

@jesusmb1995 jesusmb1995 requested a review from chetasr August 14, 2025 15:14
@jpgaribotti
Copy link

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

@yuranich
Copy link

We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork.

can we do the following:

  1. finish updating from upstream
  2. create new branch, merge this fix there
  3. try to contribute back to upstream
    is that something we can do?
    I also saw there is multimodal branch, is that something we can consider contributing back? @jpgaribotti

@jesusmb1995
Copy link
Author

jesusmb1995 commented Aug 18, 2025

Fine with me. Please create a tether branch where to merge the changes @yuranich

3. try to contribute back to upstream
   is that something we can do?

I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

@jesusmb1995 jesusmb1995 changed the title Load GGUF File From Buffer QVAC-3697: Load GGUF File From Buffer Aug 18, 2025
@olek-tether olek-tether self-requested a review August 18, 2025 20:26
@yuranich
Copy link

Fine with me. Please create a tether branch where to merge the changes @yuranich

3. try to contribute back to upstream
   is that something we can do?

I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits.

temp-load-from-buffer
created @jesusmb1995

@jesusmb1995 jesusmb1995 changed the base branch from master to temp-load-from-buffer August 19, 2025 07:30
@jesusmb1995
Copy link
Author

Force-pushed to attempt to fix CI on some platforms, due to different compilers/configs it was failing on some of them.

@jesusmb1995 jesusmb1995 force-pushed the jmb/memory_load_pr branch 7 times, most recently from 4277f06 to 4d263be Compare August 21, 2025 10:44
- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff.
- Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.
Override the pure virtual interface with a class that can operate on a single memory buffer.
Auxiliary function to convert a list of C strings to a vector of C++ strings.
Add new GGUF reader implementation that can read metadata from a memory buffer.
- Add code to be able to load a gguf file from a variant (memory or disk).
- Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).
Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.
Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.
A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.
Handles the logic for incrementally loading files and tensors is model shards.
Refactor backend buffer creation (for model loading) into functions.
- The function now takes size_data instead of the member attribute.
- Sanity checks of file pointer handles

These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.
Adapt the loader and model load to incrementally load files and upload tensors.
Add functions to Llama.cpp public headers to asynchronously load shards.
Split some common loading functionallity. This will help in the memory loading tests.
Add a submodule with re-usable code for tests.
Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.
Adapt simple example to showcase how to load from memory. Can be configured with environment variables.

Qwen3, for example, can be used with the simple example.
Add some automatic tests that load from memory (single buffer or multiple async splits)
@jesusmb1995
Copy link
Author

Most CI pipelines pass now. Some target failures seem unrelated.

@jesusmb1995
Copy link
Author

Most CI pipelines pass now. Some target failures seem unrelated.

@jpgaribotti @yuranich Can you suggest what to do with remaining failing CI pipelines? Seem to be due to unrelated issues, for example:

Run ARTIFACTS_JSON=$(curl -s -L \
Finding latest macos-latest-Release artifact...
No suitable Dawn artifact found!

Is it okay to proceed with the review as it is?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants